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Abstract 

This paper formally studies the question of how much paral- 
lelism is available in call-by-value functional languages with 
no parallel extensions {i.e., the functional subsets of ML or 
Scheme). In particular we are interested in placing bounds 
on how much parallelism is available for various problems. 
To do this we introduce a complexity model, the PAL, based 
on the call-by-value A-calculus. The model is defined in 
terms of a profiling semantics and measures complexity in 
terms of the total work and the parallel depth of a com- 
putation. We describe a simulation of the A-PAL (the PAL 
extended with arithmetic operations) on various parallel ma- 
chine models, including the butterfly, hypercube, and PRAM 
models and prove simulation bounds. In particular the sim- 
ulations are work-efficient (the processor-time product on 
the machines is within a constant factor of the work on the 
A-PAL), and for p processors the slowdown (time on the 
machines divided by depth on the A-PAL) is proportional 
to at most O(logp). We also prove bounds for simulating 
the PRAM on the A-PAL. 

Based on the model, we describe and analyze tree-based 
versions of quicksort and mergesort. We show that for an 
input of size n these algorithms run on the A-PAL model 
with 0(n log n) work and 0(log 2 n) depth (expected case for 
quicksort). 

1 Introduction 

Many researchers have argued that an important aspect of 
purely functional languages is their inherent parallelism — 
since the languages lack side effects, subexpressions may 
safely be evaluated in parallel. Furthermore, researchers 
have presented many implementation techniques to take ad- 
vantage of this parallelism, including data-flow [28], par- 
allel graph reduction [20, 30], and various compiler tech- 
niques [14]. Such work has suggested that it might not be 
necessary to add explicit parallel constructs to functional 
languages to get adequate parallelism from functional lan- 
guages. 

There has been little study, however, of how much par- 
allelism can be achieved for various problems, or how the 
inherent parallelism in functional languages relates to more 
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standard models used for analyzing parallel algorithms, such 
as the PRAM. For example, what are asymptotic bounds for 
sorting using a parallel implementation of a functional lan- 
guage such as ML or Haskell? What kind of sort would we 
use? How would the bounds compare with parallel sorting 
algorithms designed for various machine models? Does it 
matter whether the language is strict or lazy? Before these 
can be answered, we first need to augment functional lan- 
guages with a formal model of complexity. Furthermore, 
if we want to compare results to previous research on par- 
allel algorithms, we also need to relate this complexity to 
ran time on various machine models. This relation needs 
to capture some aspects of the parallel implementation of 
the language. To address these issues this paper makes the 
following contributions: 

1. We introduce a parallel model based on the pure A- 
calculus using applicative order (call-by-value) eval- 
uation and specified in terms of a profiling seman- 
tics [38, 39]. This semantics defines two measures of 
complexity. The work is the total amount of com- 
putation executed by a program. The computational 
depth (or simply depth) is the depth of the computa- 
tion tree, assuming that the two subexpressions of an 
application e\ e2 are evaluated in parallel. The lan- 
guage is basically equivalent within constant factors 
of complexity to the functional subsets of eager lan- 
guages such as ML or Scheme when the parallelism in 
those languages comes from evaluating arguments in 
parallel [7], This correspondence allows us to use the 
simpler A-calculus to prove results about the complex- 
ity model while using an ML-like language to prove 
results about algorithms. 

2. We prove results on how the complexities in our model 
relate to complexities of various machine-based mod- 
els, including the PRAM [15], hypercube, and but- 
terfly models. For the PRAM, we examine both the 
concurrent read, concurrent write (CRCW) and con- 
current read, exclusive write (CREW) variants. The 
results are summarized in Figure 1. The proofs in- 
troduce a parallel version of the SECD machine [25], 
the P-ECD machine. A state of the P-ECD machine 
consists of a set of substates, and each state transition 
of the machine transforms this set into a new set of 
substates. On each step the substates are scheduled 
across the processors of the host machine. We also 
prove results for simulating the PRAM model on our 
model. 
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Figure 1: The mapping of work (w) and depth (d) in the 
proposed model (the A-PAL) to running time on various 
machine models. The number of processors on the machine 
is p. For the randomized algorithms the running times are 
high-probability bounds (i.e., they will run within the speci- 
fied time with very high probability). All the results assume 
that the number of independent variable names in a program 
is constant, as will be discussed in Section 3. For the butter- 
fly we assume it has plgp switches, and for the hypercube, 
we assume the multiport version (can communicate over all 
wires simultaneously). 

3. We provide examples of analyzing algorithms, specifi- 
cally parallel versions of quicksort and mergesort. Se- 
quences of size n are stored as balanced trees, since for 
sequences stored as a list, any algorithm would require 
Q(n) depth just to traverse the list. This accentuates 
the importance of storing data as trees rather than lists 
to take advantage of parallel implementations of func- 
tional languages. The merging in mergesort borrows 
ideas from algorithms designed for the PRAM [41], 
but has some substantial changes to make up for the 
lack of random access. Both sorting algorithms require 
O(nlogri) work and 0(log 2 n) depth, and our work 
bounds are optimal for both merging and sorting, and 
our depth bounds are optimal for merging. 

Applicative-order evaluation is used instead of normal- 
order evaluation because of ambiguities in defining a for- 
mal model based on normal-order evaluation. The prob- 
lem is that normal-order evaluation can have wide range of 
implementations, such as call-by-name, call-by-need (lazy), 
and call- by-speculation (lenient) 1 , and these implementa- 
tions would have very different complexity models. The first 
two, call-by-name and call-by-need, actually offer no signif- 
icant parallelism [23]. Call-by-speculation offers plenty of 
parallelism but does the same amount of work as applicative- 
order semantics. In particular, a model that uses call-by- 
speculation would give the same asymptotic work bounds 
as our model, although it might be possible to improve 
some depth bounds. Most implementations of lazy lan- 
guages suggested in the literature sit somewhere between 
call-by-need and call-by-speculation. Typically some heuris- 
tic or strictness analysis is used to decide when to use call- 
by-speculation instead of call-by-need, and there is some way 
to garbage collect speculative computations that are never 
needed. In these implementations a complexity model would 
depend critically on what heuristics are used or how good 
the strictness analysis is. An interesting line of future work 
would be to formally compare implementations using their 
complexity models. 

One inconvenience with our model is the need to keep 
track of how many variable names are needed. In particular, 
our simulation bounds need to include the logarithm of the 

1 We use the term to mean a fully speculative implementation [19]. 



number of independent variables (v e ) in order to account for 
variable lookup. Fortunately it is straightforward to show 
that the number of variables for algorithms, such as sorting, 
is independent of the size of the input, so that v e does not 
effect the asymptotic bounds. Another choice would be to 
restrict the A-calculus to only allow a constant number of 
variables. This, however, would require that we chose a 
particular constant and then show how to convert programs 
with more variables into this fixed constant number. 

The paper is organized as follows. Section 2 describes 
the model and Sections 3 and 4 relate the model to various 
machine models. Section 5 gives algorithms for sorting and 
merging. Section 6 discusses related work. 

2 The PAL Model 

Our model is based on the untyped A-calculus using an 
applicative-order (call-by-value) operational semantics that 
is augmented with complexity measures. We chose the A- 
calculus rather than a specific language since its simplicity 
makes the simulation results in Section 3 much cleaner, and 
many features of modern languages (e.g., data-types, con- 
ditionals, recursion, and local variables) can be simulated 
with constant overhead [7], therefore not affecting asymp- 
totic performance. The abstract syntax of the model is 

e € Expressions ::= c | x \ Xx.e | ei ei 

where the meta-variable c ranges over a set of constants. 
We refer to the pure version with no constants as the par- 
allel applicative X-calculus (PAL) model. For the sake of 
practicality, we also consider a model that includes a set 
of arithmetic constants (the integers along with some in- 
teger operators). We refer to this extended version as the 
Arithmetic-PAL (A-PAL) model. The A-PAL model can 
be simulated on the PAL with costs polylogarithmic in the 
integer range. 

In the applicative- order A-calculus the function and ar- 
gument can always be evaluated in parallel, and this is the 
only form of parallelism we consider in this paper. To ac- 
count for this parallelism our model tracks two complexity 
measures, the total work executed by a computation and the 
parallel depth of the computation. When evaluating an ex- 
pression ei e2 the work of the computation is the sum of the 
work required to evaluate ei and e2 plus the work needed to 
apply the result of ei to the result of e 2 . The depth of the 
computation is the maximum of the depths of evaluating e\ 
and e2, plus the depth of applying the result of ei to the 
result of e2. We keep track of the work in addition to the 
depth for the purpose of proving useful simulation bounds 
on parallel machines that have a fixed number of processors. 

We formalize the work and depth complexities in terms 
of a profiling semantics [38, 39], which extends the standard 
operational semantics with cost measures. The judgment 

EY-e — >■ v; w, d reads as "In the environment E, the ex- 
pression e evaluates to value v in work w and depth d." 
This relation is defined by the rules in Figure 2. 

When evaluating a program, we start with an empty en- 
vironment Q. The extension of an environment with a vari- 
able and associated value is denoted by E[x i-> v], where x 
may be in E. If E has a binding for x, the associated value 
is denoted by E(x). 

The APP and APPC rules show how work is combined 
with addition and depth with maximum. The uses of the 
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Figure 2: The profiling semantics of the PAL model. 



constant 2 in the these rules is to make an exact correspon- 
dence between work and depth and states processed in our 
simulations (see Section 3). Otherwise the constants do not 
matter since we are interested in asymptotic analysis. Pro- 
gram constants, A-expressions, and variables are assumed to 
evaluate with constant work and depth. As usual, program 
constants evaluate to themselves, A-expressions evaluate to 
closures, and the value of variables is determined by the 
current environment. Applying a constant function is also 
assumed to evaluate with constant work and depth. This 
is a reasonable assumption for most constant functions, in- 
cluding those used here. It is straightforward, however, to 
augment the model with constant functions whose work and 
depth is a function of their argument [7]. 

Definition 1 The PAL model is the \-calculus with no con- 
stants and with the semantics defined by E h e — )■ v;w,d. 

Adding Constants to the PAL Model 

We now extend the basic PAL model with arithmetic con- 
stants to obtain the Arithmetic-PAL model. These con- 
stants can be simulated in the base model, but this would 
require polylogarithmic overheads in both work and depth. 
The constants are 



c £ Constants ::= 



... I i I add I add, | mul | mul, | 
neg I div2 | pos? 



where i ranges over the integers. The primitive functions 
are addition, multiplication, negation, division by two, and 
the test for positive integers. For syntactic simplicity, all 
primitive functions are curried. The choice of primitives is 
not important, but for the purpose of lower bounds proofs 
they should be incompressible [2], which ensures that certain 
kinds of data encoding schemes cannot asymptotically im- 
prove complexity bounds, e.g., encoding arrays as integers. 
This is why general division has been omitted. 

The 5 functions for these constants are given in Figure 3. 
The two closures in the 5-rule for pos? are standard encod- 



5(add,j) = add, 5(mul, i) = mul, 

6(&dd,,i') = i + i' 5(mul,,i') = ixi' 
S(neg,i) = ~i 5(div2,z) = [i/2j 

<5(pos?,j) = if i > 0 then cl(\], x, Xy.x) 
else c/(|], x, Ay.y) 

Figure 3: The 5 functions for the A-PAL model. 



ings for the booleans and can be used to encode condition- 
als [7]. Applying each of these constants requires constant 
work. 

Definition 2 The A-PAL model is the \-calculus with the 
constants i, add, add,, mul, mul,, neg, div2, and pos? 

and with the semantics defined by E h e v; w, d. 

3 Simulating the A-PAL on Various Machines 

In this section we give simulation bounds for simulating the 
A-PAL model (or PAL) on various machine models. 

We first describe the simulation on a serial RAM and 
then extend this for the simulation on a PRAM, butterfly 
network, and hypercube. To simulate the A-PAL on the 
RAM, we use a variant of the SECD machine [25, 31] as 
an intermediate step. We first show how the work complex- 
ity of an A-PAL program is related to the number of state 
transitions of the SECD machine and then show that each 
transition can be implemented within given bounds. For the 
parallel simulations of the A-PAL, we introduce a parallel 
variant of the SECD machine, the Parallel ECD (P-ECD) 
machine. The basic idea of the P-ECD machine is that it 
keeps a set of substates that can be evaluated in parallel. A 
state transition causes each substate to convert into either 
0, 1, or 2 new substates, so the number of substates will 
vary over the computation. We show that the work com- 
plexity of a program is equal to the total number of sub- 
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states processed and that the depth complexity is exactly 
equal to the number of steps taken by the P-ECD machine. 
We then show using an appropriate scheduling how this can 
be mapped onto various machines with a fixed number of 
processors. 

The SECD machine is a state machine with transition 
g 

function =>, where a state (S, E,C,D) consists of a data 
stack S of values, an environment E, a control list C of ex- 
pressions or the symbol @ (apply), and a "dump" D which 
is a list of (S, E, C) triples used as a control stack to return 
from function calls. To evaluate an expression e, the ma- 
chine starts in the state (nil, nil, [e], nil). It halts when S is 
a singleton and both C and D are nil, with the result being 
the singleton value in 5. 

Using the SECD machine, the mapping between work in 
the A-PAL model and time on a RAM can be split into two 
simpler parts: the mapping of work in the A-PAL to the 
number of states in a SECD machine transition sequence, 
and the mapping of this to time on a RAM. 

Lemma 1 If\\ h e v; w, d, then the SECD machine eval- 
uates e to v in a transition sequence of w states. 

Proof outline: Generalizing the lemma to all environments, 
we can show by structural induction on the A-PAL evalua- 
tion derivation that if E h e — > v; w, d, then the transition 

S* 

sequence (S,E,e :: C,D) => (v :: S,E,C,D) involves w 
states. The full proof can be found in [7]. □ 

In our variant of the SECD machine, environments are 
represented as balanced trees, such as AVL trees. Extend- 
ing an environment creates a new environment sharing as 
much structure with the old environment as possible. In 
particular, extending an environment with a variable name 
already in the environment creates a new environment with 
only that binding changed, so that the new environment is 
no larger than the old. As a result, no environment cre- 
ated during the evaluation of expression e, contains more 
than the number of variables in e. In the worst case this is 
equal to the number of A-expressions since each could have 
its own variable name, but we assume without loss of gen- 
erality that names are shared among A's where it does not 
cause a conflict. In practice v e , the logarithm of the number 
of variables in e, is a small constant that is independent of 
the data size— it is easy to share names in all common data 
representations. 

Lemma 2 The SECD transition on state (S, E, C, D) can 
be simulated on a RAM in no more than klg\E\ time, for 
some constant k, where \E\ is the number of variables in E. 

Proof outline: All transitions except for environment lookup 
and environment extension can be implemented with simple 
list manipulations or primitive arithmetic operations and 
take constant time (this assumes the RAM supports the 
same arithmetic operations as the A-PAL). The balanced 
trees representing environments allow lookup and extension 
in logarithmic time. □ 

Corollary 1 Each SECD transition in the evaluation of e 
from the empty environment can be simulated on a RAM in 
no more than kv e time, for some constant k. 



Proof outline: Follows from v e bounding the depth of each 
environment in the evaluation. □ 

We note that Lemma 2 holds for a pointer machine [24, 
44, 2] as well as a RAM since the simulation does not require 
random access to memory. 

Theorem 1 If\\ V- e v;w,d, then a RAM can calculate 
v from e in no more than kv e w time, for some constant k. 

Proof: Follows from Lemma 1 and Corollary 1. □ 



Simulating the A-PAL on the P-ECD 

For the parallel simulation we introduce the P-ECD ma- 
chine. Again the simulation can be split into relating the 
complexity of the A-PAL to the number of state transitions 
of the P-ECD, and then we can bound the time to execute 
each transition on various parallel machines. 

Each step of the P-ECD machine transforms the current 
state (Q, M) into a new state. The array Q of substates 
describes the subexpressions being evaluated, and the array 
M describes the partial results obtained so far, taking the 
place of the data stack in the SECD machine. Each element 
of M contains zero (novat) or one (val(v)) partial result. 

To execute a step we process all the substates (E, C, D) 
in parallel. Processing of each substate consists of executing 
three transitions. At the beginning of the step, each substate 
consists of environment E, a balanced tree as in our SECD 
machine; control C, a single expression to be evaluated; and 
dump D, a description of how this computation is to commu- 
nicate its results. After the three transitions each substate 
results in zero, one, or two new (E, C, D) substates. The 
combination of all these new substates makes up the new Q, 
thus the size of Q can vary over time. The P-ECD machine 
starts with one substate {nil, e, nit), where e is the program 
to be evaluated, and exits when a substate reaches a special 
Exit(u) substate, where v is the program result (this can 
only happen when it is the only substate left). We are also 
guaranteed that a step results in no new substates if and 
only if the computation is finishing. 

The three transitions of a step, eval, valf, and vala, are 
defined in Figure 4. The eval substep may create an in- 
termediate substate res(i>, D) containing the value of this 
subcomputation which is then communicated by one of valf 
and vala. These two substeps coordinate the intermediate 
results obtained from evaluating functions and arguments, 
so the processors must synchronize between these latter sub- 
steps. Array M can be side-effected by the substeps: eval 
can extend the array, and valf and vala can update its con- 
tents. 

We now argue informally why the machine works. The 
interesting transitions are eval on applications and the non- 
identity valf and vala transitions. This eval transition cre- 
ates two new substates, one each to evaluate the function 
and argument. The index i added to the dump D is guaran- 
teed to be independent for each substate processed (e.g., the 
processor ID plus the number of substates processed in pre- 
vious steps) and is used as an index into M. Whichever cal- 
culation completes first writes its result into M, and returns 
no substates. Whenever the second calculation completes, 
it reads the result from M, and initiates the application of 
t'i to i>2 • In the case that the two branches complete on the 
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E, c, 

E, Xx.e, 

E, x, 

E, ei e 2 , 



D res(c, D) 

D 20 res(cl(E,x,e),D) 

D res(E(x),D) 

D M, := noval; 

2S((E, e u fn(E,i) :: D), {E, e 2 , arg(£, i) £>}) 
where i is new 



£, @(d(£,a;,e),t)), D ^ lS((£[a; ^ «],e,£>)) 
E, @{c,v), D ™ l res(<5(c, v), D) 



res(i;, ml) 
res(«,fn(E,i) :: D) 

res(t>, arg(£, »') :: D) 



Exit(u) 

case M, o/ 
«a/(«') lS((B,@(t), v'),D)) 
noval M, := tia/(t;);OS 

case M, of 
val(v')^lS({E,@(v',v),D}) 
noval M t := val(v);OS 



constant 
lambda 
variable 
apply 

func-call 
prim-call 

exit 

left-return 
right-return 



Otherwise, ua//and vala are identities. 



Qj Q] m Q'; ^ Q' } ", for each ;€{!,...,,} 



Figure 4: Transitions on the substates of the P-ECD. On each step, each substate leads to zero, one, or two new substates 
(OS, IS, or 2S) for the following step. Semicolons are used to sequence a group of statements. 
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Figure 5: P-ECD example evaluation using the expression 
add (add 1 2) (add 3 4). The total work is the sum over 
all steps of the lengths of Q. 

same step, we guarantee that they both do not believe that 
the other is still running by synchronizing between the valf 
and vala phases. (With an atomic test-and-8et, synchroniz- 
ing could be avoided.) 

As an example of the execution of the P-ECD, Figure 5 
shows Q at the beginning of each step of evaluating the 
expression (add (add 1 2) (add 3 4)). 

Lemma 3 For all expressions e, if there exists a value v 

such that \\ h e —)■ v; w, d, then v is calculated from e us- 
ing d steps of a P-ECD machine. Furthermore, the P-ECD 
calculation processes a total of w substates. 



Proof: We prove that the number of steps taken by the 
P-ECD machine is d by induction on the structure of the 
A- PAL evaluation derivation. The induction hypothesis is 

that if E h e v; w, d and the P-ECD machine at step s is 
in a state (Q, M) such that substate (E, e, D) is in Q, then 
an instance of the eval substep of step s + d — 1 results in 
res(t;, D). 

CONST, LAM, or VAR: The current eval substep results 
in res(t), D). By the profiling semantics, d — 1, so the 
hypothesis is true. 

APP: By eval, two substates (E, ei,Di) and (E, e 2 , D 2 ) are 
created after one step. By the induction hypothesis, 
ei completes after di steps, and e 2 completes after d 2 
steps. If the calculation forei completes before the cal- 
culation for e 2 (i.e., di < d 2 ), then when e 2 completes, 
(E, @(vi,v 2 ), D) is in Q at step s + d 2 + 1. Otherwise, 
when ei completes, (E, @(vi, v 2 ), D) is in Q at step 
s + di + 1. Therefore, {E' , @(cl(E, x, e), v 2 ), D) is in Q 
at step s+max(di, d 2 ) + l. At the beginning of the next 
step, 5 + max(di,d 2 ) + 2, the substate (E[x y-* v],e,D) 
is in Q. By the induction hypothesis, an instance of 
the eval substep of step (s+max(di , d 2 ) + 2)+d 3 — 1 re- 
sults in res(i> , D). Since the profiling semantics shows 
that d = max(di,d 2 ) + d 3 +2, this gives the desired 
results. 

APPC: The argument is the similar to the previous rule, 
except that at the beginning of step s + max(di , d 2 ) + l 
the substate (E,@(c,v 2 ),D) is in Q, and an instance 
of the eval substep results in res(v,D). 
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Now we show that the calculation processes w substates, 
using induction on the A-PAL derivation. 

CONST, LAM, or VAR: Exactly one P-ECD substate is 
processed for each of these A-PAL rules. 

APP: By induction, computing e\, e2, and e' processes v>i, 
W2, and u>3 substates, respectively. In addition, one 
substate with expression e\ e 2 and one with expression 
@(cl(E, x, e), v) are processed, so the total processed 

is W — Wl + W2 + W3 + 2. 

APPC: By induction, computing ei and e 2 processes w>i 
and W2 substates, respectively. In addition, one sub- 
state with expression e\ e 2 and one with expression 
@(c, v) are processed, so the total processed is w = 

Wl + 102 + 2. 

□ 



Simulating the A-PAL on other Parallel Machines 

We now need to show how to simulate the P-ECD machine 
on a PRAM, butterfly network, and hypercube. For the but- 
terfly we assume that for p processors we have p lg p switches 
and p memory banks, and that memory references can be 
pipelined through the switches. On such a machine each 
of the p processors can access (read or write) n elements 
in 0(n + logp) time with high probability [27, 33]. The 
O(logp) time is due to latency through the network. We 
also assume the butterfly network has simple integer adders 
in the switches, such that a prefix-sum computation can ex- 
ecute in O(logp) time. A separate prefix tree, such as on 
the CM-5, would also be adequate. For the hypercube we 
assume a multiport hypercube in which messages can cross 
all wires on each time step, and for which there are separate 
queues for each wire. This model is quite similar to butter- 
fly and has the same bounds for simulating shared memory. 
However, we do not need to assume that the switches have 
integer adders. As in the previous models, we assume that 
primitive function calls can be implemented in constant time 
on a single processor. 

Lemma 4 Each step of the P-ECD machine with q sub- 
states can be processed on a p processor machine within the 
following time bounds: 



Machine Model 


Time 


CREW PRAM 
CRCW PRAM 
CRCW PRAM (rand.) 
Butterfly (rand.) 
Hypercube (rand.) 


kv e (\q/p \ +logp) 
kve(\q/p] + log logp) 
kv e (\q/p] +log*p) 
kv e (\q/p] +logp) 
kv e (\q/p] + logp) 



for some constant k, where the bounds on randomized ma- 
chines hold with high probability. 

Proof: For the simulation we keep the substates returned 
by each step in an array. If this substate array is of size q, 
each processor is responsible for qjp elements of the array 
(i.e., processor i is responsible for the elements [iq/p, . . ■ , (i+ 
l)q/p — 1]). We assume each processor knows its own pro- 
cessor number, so it can calculate a pointer to its section 
of the array. For the CREW and butterfly simulations the 



size of the array is exactly q. For the CRCW PRAM sim- 
ulations the array can have holes in it that don't contain 
substates, as explained below. These holes are marked, and 
we guarantee that the total length of the array is at most 
kq for some constant k. This means that each processor is 
responsible for at most kq/p elements. 

The simulation of a step consists of the following sub- 
steps: 

1. Locally evaluating the substates using the eval transi- 
tion in Figure 4. This requires accessing shared mem- 
ory for reading but requires no communication among 
the substates. 

2. Evaluating the ua//and vala transitions. This requires 
a synchronization between the two transitions. Each 
processor first applies the do!/ transitions for all the 
substates for which it is responsible. The processors 
then synchronize, and then each processor applies the 
vala transitions. 

3. Creating a new substate array for the next step. After 
the substep transitions, each array element contains 
zero, one, or two substates (OS, IS, or 2S), and these 
must be distributed into the new array. 

We need to show that each of these steps can be executed in 
the given bounds. The first step requires the time it takes 
to process qjp substates. The eval transition is similar to 
the eval for the serial SECD machine. The only real dif- 
ference is the apply transition. Each of the other substate 
transitions require the v e time that was required in the se- 
rial machine and can have at most v e memory references. 
The apply transition can also be executed in these bounds 
since it just requires an additional memory write. We can 
generate the independent i's simply by using the array index 
for the substate added to an offset which gets reset on each 
round. None of the memory references require concurrent 
writes. The time for the first substep on the CREW and 
CRCW PRAM is therefore q/p. The time on the butterfly 
and hypercube is q/p + lgp since the memory references re- 
quire a lgp latency through the network. The second step 
can also be executed in the same bounds. 

The third step requires generating a new substate array. 
Each transitioned substate of the old array contains zero, 
one, or two substates, which need to be distributed into a 
new array for the next step. For the CREW PRAM and 
butterfly this can be done by executing a prefix-sum on the 
number of new substates and using the result as an offset 
into the new array. In both cases for p processors the prefix 
sum and writing into the new array can run in 0(q/p+ logp) 
time. This will give a new array that is exactly the length 
of the number of new substates. On the CRCW PRAM the 
distribution into the new array can be done more efficiently 
using a solution to the linear approximate compaction prob- 
lem [26]: given an array of n cells, m of which contain an 
object, place the m objects in distinct cells of an array of 
size km for some constant k > 1. The idea is to first allocate 
two new positions for each substate, mark the substates that 
will remain (neither for OS, one for IS, and both for 2S) 
and then do an approximate compaction. Since the result 
array is a constant times larger than the total number of re- 
maining substates, we will maintain the invariant mentioned 
earlier. Gil, Matias, and Vishkin [16] have shown that the 
linear approximate compaction problem can be solved on a 
p processor CRCW PRAM (arbitrary) in 0(n/p + log* p) 
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expected time (using a randomized solution). Goldberg and 
Zwick [17] have recently shown that the problem can be 
solved deterministically in 0(njp + log log p) time. 

When we add the times for the three substeps, we get 
the stated bounds for each of the machines. □ 



Theorem 2 7/ Q h e — > v; w, d, then v can be calculated 
from e on a CREW PRAM with p processors in kv e (w/p + 
dlogp) time, for some constant k. Analogous results are 
true for the other models. 

Proof: The proof uses Brent's scheduling principle [9]. We 
prove it for the CREW PRAM, but the other proofs are 
almost identical. We assume that step i of the P-ECD pro- 
cesses q t substates. We know from Lemma 3 that YH=o == 
w. We also know from Lemma 4 that it takes k've(\qjp] + 
logp) time to process step i. The total time to process all 
substates is then 

,<d 

T = 5>V(r<7,/pl+logp) 
•=o 

t<d 

< fcV^(g,/p + l + logp) 

1 = 0 

= fc'« e ((£ g ,/p)+d(l + logp)) 

«=0 

= k'v e {w/p + d{l+lo S p)) 

< 2k've(w/p + dlogp) 
= kv e (ui/p + dlogp) 

where we have set k — 2k' . □ 



4 Simulating a PRAM on an A-PAL 

In this section we consider simulating a PRAM on an A- 
PAL. The simulation we use gives the same results for the 
EREW, CREW, and CRCW PRAM as well as for the multi- 
prefix [32] and scan models [4]. The simulation is optimal in 
terms of work for all the PRAM variants. This is because it 
takes logarithmic work to simulate each random access into 
memory (this is the same as for pointer machines [2]). Since 
we don't know how to do better for the weaker models, we 
will base our results on the most powerful model, the CRCW 
PRAM with unit-time multiprefix sums (MP PRAM). 

Theorem 3 A program that runs in time t on a p processor 
MP PRAM using m memory can he simulated on the A- 
PAL model with fc TO plogm work and kdt log m logp depth, 
for some constants k w and kd- 

Proof: We will simulate a PRAM based on state transitions 
on the state (C, M, P) where C is the code, M is the mem- 
ory, and P is state for all the processors (i.e., registers and 
program counter). Let c = \C\, m = \M\, and p = \P\. We 
assume C, M, and P are stored as balanced binary trees and 
that p < m, and c < m. Each state transition corresponds 
to a step of the PRAM, and the processors will be strictly 



synchronous. Register-to-register instructions can be imple- 
mented with 0(p) work and O(logp) depth, and concurrent 
reads with 0(p log m) work and C*(log m) depth. This just 
requires traversing the appropriate trees. The writes are the 
only interesting instruction to implement, and can be imple- 
mented by sorting the write requests from the processors by 
address and then recursively splitting the requests at each 
node of M as we insert them. We can sort the p requests 
in O(plogp) work and 0(log 2 p) depth as discussed in the 
next section. We assume the sorted requests, which we call 
the write-tree, start out balanced and are sorted from left 
to right in the tree. To implement a concurrent write or 
multiprefix, we combine nodes in the write-tree that have 
the same address. Since the addresses are sorted this can be 
done in O(p) work and O(logp) depth. 

We now consider the insertion of the sorted requests of a 
write-tree W into memory M (modif y (M , W) ). We assume 
that M stores the addresses and associated values at the 
leaves, ordering the addresses from left-to-right, and that 
the internal nodes contain the value of the greatest address 
in the left branch. We assume all addresses in W are also 
in M, and that each node of W stores the minimum and 
maximum address of its descendants, so that we can access 
these in constant work and depth. To insert W into M, 
we first check if M is a single node, in which case W must 
also be a single node, and we simply modify the value and 
return. Otherwise, we check if all the addresses in W belong 
to just one of the branches of the M tree. If so, we call 
modify recursively on that branch of M with the same W 
and put the result back together with the other branch of 
M when the call returns. If not, we split W based on the 
address stored at the root of M and call modify in parallel 
on the two children of M and the two split parts of W . This 
algorithm works since all addresses in the original write-tree 
will eventually find their way to the appropriate leaf of the 
M tree and modify that leaf. 

We now consider the total work and depth required. 
Splitting W into two trees based on a key can be imple- 
mented in O(logp) work and depth by following down to 
the appropriate leaf, splitting along the way. Since M is of 
depth Ig m, the total depth complexity is therefore bound 
by 0(logplog m). To prove the bounds on the work, we ob- 
serve that it cannot take more than O(plogp) work to split 
the tree into p pieces of size 1 since each split takes O(logp) 
work and there are p — 1 of them. This means the total work 
needed to split the original write-tree is bound by 0(plogp). 
The only other work is the check at each node of the M tree 
of whether we have to split or send all values down to one 
or the other branches. The maximum work done for these 
checks is 0(p log m) since there can be at most p separate 
chains (one per leaf of the write-tree) each which is at most 
as deep as the M tree (O(logm)). The total work is there- 
fore 0(p(logp + log m)) = 0(p log m). □ 



5 Analyzing Algorithms 

In this section we examine how the model can be used to an- 
alyze algorithms. As examples, we describe parallel versions 
of quicksort and mergesort. These two algorithms illustrate 
some of the techniques necessary for programming efficient 
algorithms in the model. 

We first note that any sorting algorithm that represents 
its input as a list requires depth at least proportional to 



232 



6 9 15 13 5 18 3 5 6 9 13 15 

Unsorted Tree Sorted Tree 

datatype 'a Tree = 

Empty I Leaf of 'a 
I Node of int * 'a Tree * 'a Tree 



Figure 6: Representing sequences as trees. The values are 
stored at the leaves and each internal node stores the size of 
its subtree (the number of leaves below it). 

its input size — this is the time required just to look at all 
the elements. In fact a simple mergesort that makes its 
two recursive calls in parallel will match this lower bound 
for depth. To derive parallel algorithms that are sublinear 
in the input size requires that the input and output are 
represented as trees. This section shows how trees can be 
used to derive effective parallel versions of quicksort and 
mergesort and analyzes these versions in the PAL model. 
The tree representation we will use is given in Figure 6. We 
assume that the ordering for sorted sequences is specified by 
a left-to-right traversal of the tree. 

Parallel Quicksort: The code for our quicksort algorithm is 
given in Figure 7. The function qsort _rec returns a sorted 
tree, but in general it will not be perfectly balanced, so the 
function rebalance rebalances it. The function qsortjrec is 
similar to the sequential version of quicksort on lists, except 
that elt, select, and append are implemented on trees. 
The function elt can be implemented by traversing the tree 
down to the appropriate leaf, and for a tree of depth of, this 
requires 0(d) work and depth (there is no parallelism). The 
function select is implemented by calling itself recursively 
in parallel on both branches and putting the results back 
together. Assuming the function f has constant work and 
depth, select on a tree of size n and depth d requires 0(n) 
work and 0(d) depth. We note that the tree returned by 
select is generally not going to be balanced, which is why 
we do not assume that d = lg n. The append function simply 
puts its two arguments together in a tree node and therefore 
has constant work and depth. 

We first present a general theorem that bounds work and 
depth for our quicksort in the expected case for any input 
tree, even if not balanced, and as a corollary give the bounds 
for balanced input. 

Theorem 4 The quicksort algorithm specified in Figure 7 
when applied to a tree with n leaves and depth d will exe- 
cute in O(nlogrc) work and O(dlogn) depth on the A- PAL 
model, both expected case (i.e., average over all possible in- 
puts of that depth and size). 

Proof: We first consider qsort_rec. We note that since the 
pivots in quicksort will not perfectly split the data, some re- 
cursive paths will be longer than others. We call the longest 
path of recursive calls for qsort _rec on a particular input 
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fun append Empty b = b 
I append a Empty = a 

I append a b = Tree (size(a)+size(b) ,a,b) 

fun select f Empty = empty 
I select f (Leaf x) = 

if f x then Leaf x else Empty 
I select f (Tree (_,left, right)) 

append (select f left) (select f right) 

fun qsort_rec x = 

if (size x) < 2 then x 
else 

let val pivot = elt a ((size x)/2) 

val less = select (fn x => x < pivot) a 
val equal = select (fn x => x = pivot) a 
val greater = select (fn x => x > pivot) a 

in 

append (qsort.rec less) 

(append equal (qsort.rec greater)) 

end 

fun rebalance Empty = Empty 
I rebalance (Leaf x) = Leaf x 
I rebalance a = 

let val half = (size x)/2 
in 

append (rebalance (take x half) 
(rebalance (drop x half)) 

end 

fun quicksort a = rebalance (qsort.rec a) 

Figure 7: An example diagram of the select function and 
the code for the parallel quicksort algorithm. 
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the recursion depth for that input. We note that the worst 
case recursion depth is 0(n) and that fewer than 1 out of n 
of the possible inputs will lead to a recursion depth greater 
than Hogn [34]. To determine the total computational 
depth of qsort jrec, we need to consider the computational 
depth along the longest path. We claim that this computa- 
tional depth is at most 0(d) times the recursion depth since 
each node along the recursion tree will require at most 0(d) 
depth. This is because elt and select will run in 0(d) 
depth. 2 . Since a fraction of only l/n of the inputs will have 
a recursion depth greater than 0(log n), and these cases will 
have recursion depth at most 0(n), the average (expected 
case) computation depth of qsort jrec is 

D(n) = 0(d(\ogn + -n)) = 0(rflog n). 

To see that the work is expected to be 0(n log n), we simply 
note that all steps do no more than a constant fraction more 
work than a list-based sequential implementation. 

We now briefly consider the routine rebalance. We note 
that the depth of the tree returned by qsort_rec is at most a 
constant times the recursion depth. The function rebalance 
is implemented by splitting the tree along the path that 
separates the tree into two equal size pieces (or off by 1), 
recursively calls itself on the two parts, and appends the 
results. We claim that for a tree of size n and depth d it win 
run with 0(n log n) work and 0(<ilog n) depth (worst case). 
Given the above bounds on the recursion depth, this gives 
an expected depth of 0(log 2 n). □ 

Corollary 2 The quicksort algorithm specified in Figure 7 
when applied to a balanced tree with n leaves will execute in 
0(n log n) work and 0(log 2 n) depth on the A-PAL model, 
both expected case. 

Parallel Mergesort: We first consider the problem of merg- 
ing two sorted trees. We use n to refer to sum of the sizes 
of the two trees. We assume that each internal node of 
the input trees contains the maximum value of its descen- 
dants, as well as its size. This is clearly easy to generate 
in 0(n) work and 0(log n) depth. The main component of 
the parallel algorithm is a routine selectJkth which given 
two ordered trees a and b, returns the k th smallest value 
from the combination of the two sequences (see Figure 8). 
It is implemented using a dual binary search in which we go 
down a branch from one of the two sequences on each step, 
using the maximal element at each node for navigation. As- 
suming the depths of the two trees are d a and db, the work 
and depth complexity of this routine is 0(d a + db). 

To merge two trees, we use select-kth to find their com- 
bined median element. We then select the elements less and 
greater, respectively, than the median for each tree with 
the functions take_less and dropJLess. These can be im- 
plemented with 0(log n) work and depth since the trees are 
sorted and balanced (it just requires going down a tree split- 
ting along the way). Recursively merging the two trees of 
lesser elements and the two trees of greater elements gives us 
two sorted trees which are guaranteed to be the same size (or 
off by one) by construction. So, joining them under a new 
node produces a balanced sorted tree. As a whole, merging 

2 Note that although select does not return balanced trees, it will 
never return a tree with depth greater than the original tree, which 
has depth d 



datatype 'a Tree = 

Empty I Leaf of >a 
I Node of int * 'a * 'a Tree * 'a Tree 

fun select.kth k (Leaf vl) (Leaf v2) = 

if v2 > vi then if k = 0 then vl else vO 
else if k = 0 then vO else vl 
I select_kth k (Leaf vl) (Node (n2 , v2 , 12 , r2) ) = 
if v2 > vl then if k > n2 

then select _kth (k-n2) (Leaf vl) r2 
else select_kth k (Leaf vl) 12 
else if n2 > k 

then select_kth k (Leaf vl) 12 
else select_kth (k-n2) (Leaf vl) r2 
I select.kth k (Bode (nl , vl ,11 ,rl ) ) (Leaf v2) = 
select.kth k (Leaf v2) (Hode (nl ,vl ,11 ,rl) ) = 
I select.kth k (Node (nl ,vl ,11 ,rl) ) 

(Node (n2,v2,12,r2)) = 
if v2 > vl then if k > (nl+n2) 

then select.kth k (Node (nl ,vl ,11 ,rl)) 12 
else select.kth (k-nl) rl (Node (n2 ,v2 ,12 ,r2) ) 
else if k > (nl+n2) 

then select.kth k 11 (Node (n2 , v2 ,12 ,r2) ) 
else select.kth (k-nl) (Node (nl ,vl ,11 ,rl) ) r2 

fun merge (Leaf x) b = insert x b 
I merge a (Leaf y) = insert y a 
I merge a b = 

let val k = ((size a) + (size b)) / 2 
val median = select.kth k a b 

in 

append (merge (take.less a median) 
(take.less b median)) 
(merge (drop.less a median) 
(drop.less b median)) 

end 

fun merge.sort a = 

if (size a) < 2 then a 

else let val half = (size a)/2 

in merge (merge.sort (take a half)) 
(merge.sort (drop a half)) 



Figure 8: Code of the parallel mergesort algorithm. 

in this manner takes 0(n) work and 0(log 2 n) depth since 
we recurse for the lg n depth of the trees. 

Theorem 5 The mergesort algorithm specified in Figure 8 
when applied to a balanced tree with n leaves will execute in 
O(nlogn) work and 0(log 3 n) depth on the A-PAL model. 

Proof: We can write the following recurrences for work and 
depth: 

W{n) = 2W{n/2)+W me r ge (n) 
= 2W(n/2) + O(n) 
= O(nlogn) 

D(n) = £>(n/2) + £> mer ge(n) 

= D(n/2) + 0(log 2 n) 

= 0(log 3 n) 

□ 
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This version of mergesort is not as efficient as the quick- 
sort previously described. However, if merging uses n/lgn 
splitters, rather than just the median, the depth complex- 
ities of merging and mergesort can each be improved by a 
factor of log n [7]. 

6 Related Work 

Several researchers have used cost-augmented semantics for 
automatic time analysis of serial programs [3, 38, 39, 45]. 
This work was concerned with serial running time, and since 
they were primarily interested in automatically analyzing 
programs rather than defining complexity, they each altered 
the semantics of functions to simplify such analysis. Fur- 
thermore, none related their complexity models to more tra- 
ditional machine models, although since the languages are 
serial this should not be hard. 

Roe [36, 37] and Zimmermann [46, 47] both studied pro- 
filing semantics for parallel languages. Roe formally defined 
a profiling semantics for an extended A-calculus with lenient 
evaluation. In his semantics, the two subexpressions of a 
special let expression plet x = ei in en evaluate in paral- 
lel such that the evaluation of an occurrence of x in ea is 
delayed until its value is available. To define when this is 
the case, he augmented the standard denotational semantics 
with the time that each expression begins and ends evalu- 
ation. He did not show any complexity bounds resulting 
from his definition or relate this model to any other. Zim- 
merman introduced a profiling semantics for a data-parallel 
language for the purpose of automatically analyzing PRAM 
algorithms. The language therefore almost directly modeled 
the PRAM by adding a set of PRAM-like primitive opera- 
tions. Complexity was measured in terms of time and num- 
ber of processors, as it is measured for the PRAM. It was 
not shown, however, whether the model exactly modeled the 
PRAM. In particular since it is not known until execution 
how many processors are needed, it is not clear whether the 
scheduling could be done on the fly. 

Hudak and Anderson [19] suggest modeling parallelism in 
functional languages using an extended operational seman- 
tics based on partially ordered multisets (pomsets). The 
semantics can be though of as keeping a trace of the compu- 
tation as a partial order specifying what had to be computed 
before what else. Although significantly more complicated, 
their call-by-value semantics are related to the A- PAL model 
in the following way. The work in the A- PAL model is within 
a constant factor of the number of elements in the pom- 
set, and the steps is within a constant factor of the longest 
chain in the pomset. They did not relate their model to 
other models of parallelism or describe how it would effect 
algorithms. 

Previous work on formally relating language- based mod- 
els (languages with cost-augmented semantics) to machine 
models is sparse. Jones [21] related the time-augmented se- 
mantics of simple while-loop language to that of an equiva- 
lent machine language in order to study the effect of constant 
factors in time complexity. Seidl and Wilhelm [40] provide 
complexity bounds for an implementation of graph reduction 
on the PRAM. However, their implementation only consid- 
ers a single step and requires that you know which graph 
nodes to execute in parallel in that step and that the graph 

has constant in-degree. Under these conditions they show 
how to process n nodes in 0(njp +plogp) time (which is 
a factor of p worse than our bounds in the second term, 



see Lemma 4). There have also been several experimental 
studies of how much parallelism is available in sequential 
functional languages [11, 8, 10]. 

The work-step paradigm has been used for many years 
for informally describing parallel algorithms [42, 22]. It was 
first included in a formal model by Blelloch in the VRAM [5]. 
NESL [6], a data-parallel functional language, includes com- 
plexity measures based on work and steps and has been 
used for describing and teaching parallel algorithms. Skil- 
licorn [43] also introduced cost measures specified in terms 
of work and steps for a data-parallel language based on the 
Bird-Meertens formalism. In both cases the languages were 
not based on the pure A-calculus but instead included ar- 
ray primitives. Also neither formally showed relationship 
of their models to machine models. Part of the motivation 
of the work described in this paper was to formalize the 
mapping of complexity to machine models and to see how 
much parallelism is available without adding data-parallel 
primitives. 

Domic, et al. [13] and Reistad and Gifford [35] explore 
adding time information to a functional language type sys- 
tem. But for type inference to terminate, only special forms 
of recursion can be used, such as those of the Bird-Meertens 
formalism. 

There has been much work on comparing machine mod- 
els within traditional complexity theory. The most closely 
related is that of Ben-Amram and Galil [2], who show that 
a pointer machine incurs logarithmic overhead to simulate a 
RAM. The pointer machine [24, 44] is similar to the SECD 
machine in that it addresses memory only through point- 
ers, but it lacks direct support for implementing higher- 
order functions. We borrow from them the parameteriza- 
tion of models over incompressible data types and opera- 
tions. Paige [29] also compares models similar to those used 
by Ben-Amram and Galil. 

Goodrich and Kosaraju [18] introduced a parallel pointer 
machine (PPM), but this is quite different from our model 
since it assumes a fixed number of processors and allows side 
effecting of pointers. Another parallel version of the SECD 
machine was introduced by Abramsky and Sykes [l], but 
their Secd-m machine was non-deterministic and based on 
the fair merge. 

7 Conclusions 

This paper has discussed a complexity model based on the 
A-calculus and shown various simulation results. A goal of 
this work is to bring a closer tie between parallel algorithms 
and functional languages. We believe that language-based 
complexity models, such as the ones suggested in this paper, 
could be a useful way for describing and thinking about 
parallel algorithms directly, rather than always needing to 
translate to a machine model. 

This paper leaves several open questions, including 

• We mentioned that a call-by-speculation implementa- 
tion of normal-order evaluation might allow for im- 
proved depth bounds for various problems. In partic- 
ular it allows for pipelined execution. Does this help, 
and on what problems? 

• Is it possible to sort within d = o(log 2 n), and w = 
0(n log n)? 
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• Can the bounds for simulating the A-PAL on a PRAM 
be improved? The bounds for the butterfly network 
are tight. 

• Our simulations are memory inefficient. Can good 
bounds be placed on the use of memory? 

• Because it lacks random-access, can the A-PAL model 
be simulated more efficiently than the PRAM on ma- 
chines that have less powerful communication (e.gr., 
fixed-topology networks, parallel I/O models, or the 
LOGP model [12]), and can the complexity model be 
augmented to capture the notion of locality for these 
machines? 
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